** Introduction **

For this mini project , the dataset taken from kaggle.The Indian Premier League (IPL) is a professional Twenty20 cricket league in India, typically held between March and May each year. It features eight to ten teams representing various cities or states across India. Established by the Board of Control for Cricket in India (BCCI) in 2007, the IPL is the world’s most-attended cricket league and has a significant brand value.

Clear all the dataset in environment

rm(list=ls())

Loading Required libraries

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tibble)
library(ggplot2)
library(readr)
library(sf)
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Loading required package: lattice
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Importing data

matches <- read_csv("C:/Users/Administrator/Desktop/SUMMER_A_2024/Data_Visualisations/Siri_kesidi_IPL_Decoding_Data_viz_mini_2/Siri_MP_2_data/matches.csv")
## Rows: 756 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): city, date, team1, team2, toss_winner, toss_decision, result, winn...
## dbl  (5): id, season, dl_applied, win_by_runs, win_by_wickets
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
deliveries <- read_csv("C:/Users/Administrator/Desktop/SUMMER_A_2024/Data_Visualisations/Siri_kesidi_IPL_Decoding_Data_viz_mini_2/Siri_MP_2_data/deliveries.csv")
## Rows: 179078 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): batting_team, bowling_team, batsman, non_striker, bowler, player_d...
## dbl (13): match_id, inning, over, ball, is_super_over, wide_runs, bye_runs, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Description

The data set includes match ID, season year, location, date, and important match facts, among other details related to IPL matches played in different seasons. Whether or not the Duckworth-Lewis technique is used because of weather delays, it provides information about the teams that are playing, the results of their matches, and the results of their tosses. Details on the venue, player of the match, victory margins, and umpire identities are given. While Duckworth-Lewis applications and victory margins are examples of numerical variables, most qualities are recorded as character data types. In-depth documentation is also kept of cricket-specific information such as innings, batting and bowling teams, overs, balls, runs scored, extra runs, and player dismissals.

summary(matches)
##        id              season         city               date          
##  Min.   :    1.0   Min.   :2008   Length:756         Length:756        
##  1st Qu.:  189.8   1st Qu.:2011   Class :character   Class :character  
##  Median :  378.5   Median :2013   Mode  :character   Mode  :character  
##  Mean   : 1792.2   Mean   :2013                                        
##  3rd Qu.:  567.2   3rd Qu.:2016                                        
##  Max.   :11415.0   Max.   :2019                                        
##     team1              team2           toss_winner        toss_decision     
##  Length:756         Length:756         Length:756         Length:756        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     result            dl_applied         winner           win_by_runs    
##  Length:756         Min.   :0.00000   Length:756         Min.   :  0.00  
##  Class :character   1st Qu.:0.00000   Class :character   1st Qu.:  0.00  
##  Mode  :character   Median :0.00000   Mode  :character   Median :  0.00  
##                     Mean   :0.02513                      Mean   : 13.28  
##                     3rd Qu.:0.00000                      3rd Qu.: 19.00  
##                     Max.   :1.00000                      Max.   :146.00  
##  win_by_wickets   player_of_match       venue             umpire1         
##  Min.   : 0.000   Length:756         Length:756         Length:756        
##  1st Qu.: 0.000   Class :character   Class :character   Class :character  
##  Median : 4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 3.351                                                           
##  3rd Qu.: 6.000                                                           
##  Max.   :10.000                                                           
##    umpire2            umpire3         
##  Length:756         Length:756        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
summary(deliveries)
##     match_id         inning      batting_team       bowling_team      
##  Min.   :    1   Min.   :1.000   Length:179078      Length:179078     
##  1st Qu.:  190   1st Qu.:1.000   Class :character   Class :character  
##  Median :  379   Median :1.000   Mode  :character   Mode  :character  
##  Mean   : 1802   Mean   :1.483                                        
##  3rd Qu.:  567   3rd Qu.:2.000                                        
##  Max.   :11415   Max.   :5.000                                        
##       over            ball         batsman          non_striker       
##  Min.   : 1.00   Min.   :1.000   Length:179078      Length:179078     
##  1st Qu.: 5.00   1st Qu.:2.000   Class :character   Class :character  
##  Median :10.00   Median :4.000   Mode  :character   Mode  :character  
##  Mean   :10.16   Mean   :3.616                                        
##  3rd Qu.:15.00   3rd Qu.:5.000                                        
##  Max.   :20.00   Max.   :9.000                                        
##     bowler          is_super_over         wide_runs          bye_runs       
##  Length:179078      Min.   :0.0000000   Min.   :0.00000   Min.   :0.000000  
##  Class :character   1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Mode  :character   Median :0.0000000   Median :0.00000   Median :0.000000  
##                     Mean   :0.0004523   Mean   :0.03672   Mean   :0.004936  
##                     3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##                     Max.   :1.0000000   Max.   :5.00000   Max.   :4.000000  
##   legbye_runs       noball_runs        penalty_runs      batsman_runs  
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.0e+00   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0e+00   1st Qu.:0.000  
##  Median :0.00000   Median :0.000000   Median :0.0e+00   Median :1.000  
##  Mean   :0.02114   Mean   :0.004183   Mean   :5.6e-05   Mean   :1.247  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0e+00   3rd Qu.:1.000  
##  Max.   :5.00000   Max.   :5.000000   Max.   :5.0e+00   Max.   :7.000  
##    extra_runs        total_runs     player_dismissed   dismissal_kind    
##  Min.   :0.00000   Min.   : 0.000   Length:179078      Length:179078     
##  1st Qu.:0.00000   1st Qu.: 0.000   Class :character   Class :character  
##  Median :0.00000   Median : 1.000   Mode  :character   Mode  :character  
##  Mean   :0.06703   Mean   : 1.314                                        
##  3rd Qu.:0.00000   3rd Qu.: 1.000                                        
##  Max.   :7.00000   Max.   :10.000                                        
##    fielder         
##  Length:179078     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Structure of the data

str(deliveries)
## spc_tbl_ [179,078 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ match_id        : num [1:179078] 1 1 1 1 1 1 1 1 1 1 ...
##  $ inning          : num [1:179078] 1 1 1 1 1 1 1 1 1 1 ...
##  $ batting_team    : chr [1:179078] "Sunrisers Hyderabad" "Sunrisers Hyderabad" "Sunrisers Hyderabad" "Sunrisers Hyderabad" ...
##  $ bowling_team    : chr [1:179078] "Royal Challengers Bangalore" "Royal Challengers Bangalore" "Royal Challengers Bangalore" "Royal Challengers Bangalore" ...
##  $ over            : num [1:179078] 1 1 1 1 1 1 1 2 2 2 ...
##  $ ball            : num [1:179078] 1 2 3 4 5 6 7 1 2 3 ...
##  $ batsman         : chr [1:179078] "DA Warner" "DA Warner" "DA Warner" "DA Warner" ...
##  $ non_striker     : chr [1:179078] "S Dhawan" "S Dhawan" "S Dhawan" "S Dhawan" ...
##  $ bowler          : chr [1:179078] "TS Mills" "TS Mills" "TS Mills" "TS Mills" ...
##  $ is_super_over   : num [1:179078] 0 0 0 0 0 0 0 0 0 0 ...
##  $ wide_runs       : num [1:179078] 0 0 0 0 2 0 0 0 0 0 ...
##  $ bye_runs        : num [1:179078] 0 0 0 0 0 0 0 0 0 0 ...
##  $ legbye_runs     : num [1:179078] 0 0 0 0 0 0 1 0 0 0 ...
##  $ noball_runs     : num [1:179078] 0 0 0 0 0 0 0 0 0 1 ...
##  $ penalty_runs    : num [1:179078] 0 0 0 0 0 0 0 0 0 0 ...
##  $ batsman_runs    : num [1:179078] 0 0 4 0 0 0 0 1 4 0 ...
##  $ extra_runs      : num [1:179078] 0 0 0 0 2 0 1 0 0 1 ...
##  $ total_runs      : num [1:179078] 0 0 4 0 2 0 1 1 4 1 ...
##  $ player_dismissed: chr [1:179078] NA NA NA NA ...
##  $ dismissal_kind  : chr [1:179078] NA NA NA NA ...
##  $ fielder         : chr [1:179078] NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   match_id = col_double(),
##   ..   inning = col_double(),
##   ..   batting_team = col_character(),
##   ..   bowling_team = col_character(),
##   ..   over = col_double(),
##   ..   ball = col_double(),
##   ..   batsman = col_character(),
##   ..   non_striker = col_character(),
##   ..   bowler = col_character(),
##   ..   is_super_over = col_double(),
##   ..   wide_runs = col_double(),
##   ..   bye_runs = col_double(),
##   ..   legbye_runs = col_double(),
##   ..   noball_runs = col_double(),
##   ..   penalty_runs = col_double(),
##   ..   batsman_runs = col_double(),
##   ..   extra_runs = col_double(),
##   ..   total_runs = col_double(),
##   ..   player_dismissed = col_character(),
##   ..   dismissal_kind = col_character(),
##   ..   fielder = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(matches)
## spc_tbl_ [756 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id             : num [1:756] 1 2 3 4 5 6 7 8 9 10 ...
##  $ season         : num [1:756] 2017 2017 2017 2017 2017 ...
##  $ city           : chr [1:756] "Hyderabad" "Pune" "Rajkot" "Indore" ...
##  $ date           : chr [1:756] "2017-04-05" "2017-04-06" "2017-04-07" "2017-04-08" ...
##  $ team1          : chr [1:756] "Sunrisers Hyderabad" "Mumbai Indians" "Gujarat Lions" "Rising Pune Supergiant" ...
##  $ team2          : chr [1:756] "Royal Challengers Bangalore" "Rising Pune Supergiant" "Kolkata Knight Riders" "Kings XI Punjab" ...
##  $ toss_winner    : chr [1:756] "Royal Challengers Bangalore" "Rising Pune Supergiant" "Kolkata Knight Riders" "Kings XI Punjab" ...
##  $ toss_decision  : chr [1:756] "field" "field" "field" "field" ...
##  $ result         : chr [1:756] "normal" "normal" "normal" "normal" ...
##  $ dl_applied     : num [1:756] 0 0 0 0 0 0 0 0 0 0 ...
##  $ winner         : chr [1:756] "Sunrisers Hyderabad" "Rising Pune Supergiant" "Kolkata Knight Riders" "Kings XI Punjab" ...
##  $ win_by_runs    : num [1:756] 35 0 0 0 15 0 0 0 97 0 ...
##  $ win_by_wickets : num [1:756] 0 7 10 6 0 9 4 8 0 4 ...
##  $ player_of_match: chr [1:756] "Yuvraj Singh" "SPD Smith" "CA Lynn" "GJ Maxwell" ...
##  $ venue          : chr [1:756] "Rajiv Gandhi International Stadium, Uppal" "Maharashtra Cricket Association Stadium" "Saurashtra Cricket Association Stadium" "Holkar Cricket Stadium" ...
##  $ umpire1        : chr [1:756] "AY Dandekar" "A Nand Kishore" "Nitin Menon" "AK Chaudhary" ...
##  $ umpire2        : chr [1:756] "NJ Llong" "S Ravi" "CK Nandan" "C Shamshuddin" ...
##  $ umpire3        : chr [1:756] NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   season = col_double(),
##   ..   city = col_character(),
##   ..   date = col_character(),
##   ..   team1 = col_character(),
##   ..   team2 = col_character(),
##   ..   toss_winner = col_character(),
##   ..   toss_decision = col_character(),
##   ..   result = col_character(),
##   ..   dl_applied = col_double(),
##   ..   winner = col_character(),
##   ..   win_by_runs = col_double(),
##   ..   win_by_wickets = col_double(),
##   ..   player_of_match = col_character(),
##   ..   venue = col_character(),
##   ..   umpire1 = col_character(),
##   ..   umpire2 = col_character(),
##   ..   umpire3 = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

show the distribution of total runs per over for each team, allowing for a comparison of their performance across overs.

# Calculate total runs per over per team
team_runs_per_over <- deliveries %>%
  group_by(batting_team, over) %>%
  summarise(total_runs = sum(total_runs)) %>%
  ungroup()
## `summarise()` has grouped output by 'batting_team'. You can override using the
## `.groups` argument.
team_runs_per_over
## # A tibble: 300 × 3
##    batting_team         over total_runs
##    <chr>               <dbl>      <dbl>
##  1 Chennai Super Kings     1        870
##  2 Chennai Super Kings     2       1116
##  3 Chennai Super Kings     3       1293
##  4 Chennai Super Kings     4       1354
##  5 Chennai Super Kings     5       1423
##  6 Chennai Super Kings     6       1447
##  7 Chennai Super Kings     7       1159
##  8 Chennai Super Kings     8       1186
##  9 Chennai Super Kings     9       1246
## 10 Chennai Super Kings    10       1165
## # ℹ 290 more rows
# Create a static line plot using ggplot2 with facets
line_plot <- ggplot(team_runs_per_over, aes(x = over, y = total_runs, color = batting_team)) +
  geom_line() +
  facet_wrap(~ batting_team, scales = "free_y") +
  labs(title = "Total Runs Scored per Over by Teams",
       x = "Over",
       y = "Total Runs") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Convert the static line plot to an interactive plot using plotly
interactive_line_plot <- ggplotly(line_plot)

# Save the interactive plot as a self-contained HTML file
htmlwidgets::saveWidget(interactive_line_plot, "team_runs_per_over_line_plot.html")

interactive_line_plot

What were the original charts you planned to create for this assignments? What steps were necessary for cleaning and preparing the data?

The original chart I planned to create for this assignment was a static line plot showing the total runs scored per over by each team, with facets for each batting team. To prepare the data, I first grouped the deliveries dataset by batting team and over, then calculated the sum of total runs for each combination. This involved basic data cleaning steps such as filtering and aggregating the data.

What story could you tell with your plots? What difficulties did you encounter while creating the visualizations? What additional approaches do you think can be use to explore the data you selected?

With the plots, I can illustrate the scoring patterns of each team over the course of an innings, highlighting periods of aggressive scoring or slow accumulation. The static line plot allows for easy comparison between teams, revealing any disparities in scoring rates or consistency. Difficulties encountered included ensuring the readability of the plot with multiple facets and selecting appropriate color schemes for clarity. Additional approaches to explore the data could include analyzing the distribution of runs scored in different overs.

How did you apply the principles of data visualizations and design for this assignment?

Clear labels and titles were provided to aid interpretation, and the use of facets allowed for easy comparison between teams. The interactive element enhances engagement by allowing viewers to hover over data points for more information. The minimalist theme and rotated axis labels improve readability, adhering to best practices for effective data visualization.

spatial visualization showing the locations of IPL matchesheld in India

indian_cities <- read_csv("C:/Users/Administrator/Desktop/SUMMER_A_2024/Data_Visualisations/Siri_kesidi_IPL_Decoding_Data_viz_mini_2/Siri_MP_2_data/Indian_cities.csv") 
## Rows: 1267 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): State, District, City, Population, Area (in km^2)
## dbl (2): Latitude, Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Merge IPL dataset with geographical data
matches_india <- matches %>%
  left_join(indian_cities, by = c("city" = "District"))
## Warning in left_join(., indian_cities, by = c(city = "District")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 475 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Plot matches on the map using leaflet
map <- leaflet(data = matches_india) %>%
  addTiles() %>%
  addMarkers(~Longitude, ~Latitude, popup = ~venue)
## Warning in validateCoords(lng, lat, funcName): Data contains 119 rows with
## either missing or invalid lat/lon values and will be ignored
# Display the map
map
# Save the interactive plot as a self-contained HTML file
htmlwidgets::saveWidget(map, "venue_matches_played.html")

What were the original charts you planned to create for this assignments? What steps were necessary for cleaning and preparing the data?

The original chart planned for this assignment was a spatial visualization showing the locations of IPL matches held in India on a map using shape file but I could not able to find correct information India.shp file. The necessary steps for cleaning and preparing the data involved: Loading geographical data for Indian cities, including latitude and longitude coordinates. Merging the IPL dataset with geographical data based on city names. Plotting the matches on a map using the leaflet package.

What story could you tell with your plots? What difficulties did you encounter while creating the visualizations? What additional approaches do you think can be use to explore the data you selected?

With the plotted map, one could tell the story of where IPL matches were held across India. This visualization provides insights into the distribution of matches geographically, highlighting regions where cricket is popular and where IPL teams frequently play. Difficulties encountered during visualization creation might include mismatched city names between datasets or missing geographical coordinates for certain cities. Additional approaches to explore the data could involve clustering analysis to identify regions with high match density or overlaying demographic data to understand the audience reach of IPL matches.

How did you apply the principles of data visualizations and design for this assignment?

The principles of data visualization and design were applied by choosing an appropriate visualization method (spatial visualization using a map), ensuring clarity in the plotted data points, and providing interactivity for users to explore match details. The map provides a clear representation of match locations, allowing viewers to easily understand the geographical distribution of IPL matches in India. Additionally, popup markers with venue details enhance the user experience by providing additional information upon interaction with the map.

visualization of a ML model

colSums(is.na(matches))
##              id          season            city            date           team1 
##               0               0               7               0               0 
##           team2     toss_winner   toss_decision          result      dl_applied 
##               0               0               0               0               0 
##          winner     win_by_runs  win_by_wickets player_of_match           venue 
##               4               0               0               4               0 
##         umpire1         umpire2         umpire3 
##               2               2             637
df_clean <- na.omit(matches)
str(df_clean)
## tibble [118 × 18] (S3: tbl_df/tbl/data.frame)
##  $ id             : num [1:118] 7894 7895 7896 7897 7898 ...
##  $ season         : num [1:118] 2018 2018 2018 2018 2018 ...
##  $ city           : chr [1:118] "Mumbai" "Mohali" "Kolkata" "Hyderabad" ...
##  $ date           : chr [1:118] "07/04/18" "08/04/18" "08/04/18" "09/04/18" ...
##  $ team1          : chr [1:118] "Mumbai Indians" "Delhi Daredevils" "Royal Challengers Bangalore" "Rajasthan Royals" ...
##  $ team2          : chr [1:118] "Chennai Super Kings" "Kings XI Punjab" "Kolkata Knight Riders" "Sunrisers Hyderabad" ...
##  $ toss_winner    : chr [1:118] "Chennai Super Kings" "Kings XI Punjab" "Kolkata Knight Riders" "Sunrisers Hyderabad" ...
##  $ toss_decision  : chr [1:118] "field" "field" "field" "field" ...
##  $ result         : chr [1:118] "normal" "normal" "normal" "normal" ...
##  $ dl_applied     : num [1:118] 0 0 0 0 0 1 0 0 0 0 ...
##  $ winner         : chr [1:118] "Chennai Super Kings" "Kings XI Punjab" "Kolkata Knight Riders" "Sunrisers Hyderabad" ...
##  $ win_by_runs    : num [1:118] 0 0 0 0 0 10 0 0 0 0 ...
##  $ win_by_wickets : num [1:118] 1 6 4 9 5 0 1 4 7 5 ...
##  $ player_of_match: chr [1:118] "DJ Bravo" "KL Rahul" "SP Narine" "S Dhawan" ...
##  $ venue          : chr [1:118] "Wankhede Stadium" "Punjab Cricket Association IS Bindra Stadium, Mohali" "Eden Gardens" "Rajiv Gandhi International Stadium, Uppal" ...
##  $ umpire1        : chr [1:118] "Chris Gaffaney" "Rod Tucker" "C Shamshuddin" "Nigel Llong" ...
##  $ umpire2        : chr [1:118] "A Nanda Kishore" "K Ananthapadmanabhan" "A.D Deshmukh" "Vineet Kulkarni" ...
##  $ umpire3        : chr [1:118] "Anil Chaudhary" "Nitin Menon" "S Ravi" "O Nandan" ...
##  - attr(*, "na.action")= 'omit' Named int [1:638] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:638] "1" "2" "3" "4" ...
# Assuming you have already loaded the 'matches' dataset

# Fit a linear regression model
lm_model <- lm(win_by_runs ~ win_by_wickets + season, data = df_clean)

# Summary of the linear regression model
summary(lm_model)
## 
## Call:
## lm(formula = win_by_runs ~ win_by_wickets + season, data = df_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.864  -9.396  -2.291   4.900  94.136 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2939.5158  6610.9573  -0.445    0.657    
## win_by_wickets    -3.5955     0.5085  -7.070 1.28e-10 ***
## season             1.4677     3.2752   0.448    0.655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.76 on 115 degrees of freedom
## Multiple R-squared:  0.303,  Adjusted R-squared:  0.2909 
## F-statistic:    25 on 2 and 115 DF,  p-value: 9.677e-10
# Visualize the coefficients
par(mfrow=c(1,1))
plot(lm_model)

What were the original charts you planned to create for this assignments? What steps were necessary for cleaning and preparing the data?

The original plan seems to be creating a linear regression model to predict win_by_runs using win_by_wickets and season as predictors. This requires cleaning the data, ensuring the variables are in the correct format, and checking for missing values.

What story could you tell with your plots? What difficulties did you encounter while creating the visualizations? What additional approaches do you think can be use to explore the data you selected?

With these plots, you can tell the story of how win_by_wickets and season affect the win_by_runs. The coefficients plot will show the impact of each predictor variable on the outcome, and the summary will provide statistical details of the model’s fit.One difficulty understanding how residues vs fitted interpretation could be interpreting the coefficients if the variables are not on the same scale. Normalizing or standardizing the variables could help. Additional approaches could include adding interaction terms between win_by_wickets and season to capture any combined effects.

How did you apply the principles of data visualizations and design for this assignment?

The use of plot() and summary() functions aligns with the principles of data visualization by providing clear, concise, and informative visualizations and summaries of the linear regression model. However, more visualization techniques could be explored, such as residual plots or diagnostic plots, to further validate the model assumptions and performance.

# Convert necessary variables to factors

df_clean$winner <- as.factor(df_clean$winner)
df_clean$city <- as.factor(df_clean$city)
df_clean$team1 <- as.factor(df_clean$team1)
df_clean$team2 <- as.factor(df_clean$team2)
df_clean$toss_winner <- as.factor(df_clean$toss_winner)
df_clean$toss_decision <- as.factor(df_clean$toss_decision)
df_clean$venue <- as.factor(df_clean$venue)
df_clean$umpire1 <- as.factor(df_clean$umpire1)
df_clean$umpire2 <- as.factor(df_clean$umpire2)
df_clean$umpire3 <- as.factor(df_clean$umpire3)
# Split the data into training and testing sets
set.seed(123) # for reproducibility
train_index <- sample(1:nrow(df_clean), 0.7 * nrow(df_clean))
train_data <- df_clean[train_index, ]
test_data <- df_clean[-train_index, ]

# Train the Random Forest model
rf_model <- randomForest(winner ~ ., data = train_data)

# Predict the winner
predictions <- predict(rf_model, newdata = test_data)

# Evaluate the model
conf_matrix <- confusionMatrix(predictions, test_data$winner)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##                              Reference
## Prediction                    Chennai Super Kings Delhi Capitals
##   Chennai Super Kings                           4              0
##   Delhi Capitals                                1              0
##   Delhi Daredevils                              0              0
##   Kings XI Punjab                               0              0
##   Kolkata Knight Riders                         0              0
##   Mumbai Indians                                1              0
##   Rajasthan Royals                              0              0
##   Royal Challengers Bangalore                   1              0
##   Sunrisers Hyderabad                           1              1
##                              Reference
## Prediction                    Delhi Daredevils Kings XI Punjab
##   Chennai Super Kings                        0               1
##   Delhi Capitals                             0               0
##   Delhi Daredevils                           0               0
##   Kings XI Punjab                            0               2
##   Kolkata Knight Riders                      0               0
##   Mumbai Indians                             0               0
##   Rajasthan Royals                           0               1
##   Royal Challengers Bangalore                0               0
##   Sunrisers Hyderabad                        0               1
##                              Reference
## Prediction                    Kolkata Knight Riders Mumbai Indians
##   Chennai Super Kings                             1              2
##   Delhi Capitals                                  0              0
##   Delhi Daredevils                                0              0
##   Kings XI Punjab                                 0              0
##   Kolkata Knight Riders                           3              0
##   Mumbai Indians                                  0              2
##   Rajasthan Royals                                0              0
##   Royal Challengers Bangalore                     2              0
##   Sunrisers Hyderabad                             0              0
##                              Reference
## Prediction                    Rajasthan Royals Royal Challengers Bangalore
##   Chennai Super Kings                        0                           0
##   Delhi Capitals                             0                           0
##   Delhi Daredevils                           0                           0
##   Kings XI Punjab                            0                           1
##   Kolkata Knight Riders                      0                           0
##   Mumbai Indians                             0                           0
##   Rajasthan Royals                           3                           0
##   Royal Challengers Bangalore                0                           2
##   Sunrisers Hyderabad                        1                           1
##                              Reference
## Prediction                    Sunrisers Hyderabad
##   Chennai Super Kings                           0
##   Delhi Capitals                                0
##   Delhi Daredevils                              0
##   Kings XI Punjab                               0
##   Kolkata Knight Riders                         1
##   Mumbai Indians                                0
##   Rajasthan Royals                              1
##   Royal Challengers Bangalore                   0
##   Sunrisers Hyderabad                           2
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.3292, 0.6708)
##     No Information Rate : 0.2222          
##     P-Value [Acc > NIR] : 0.0002328       
##                                           
##                   Kappa : 0.4173          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Chennai Super Kings Class: Delhi Capitals
## Sensitivity                              0.5000               0.00000
## Specificity                              0.8571               0.97143
## Pos Pred Value                           0.5000               0.00000
## Neg Pred Value                           0.8571               0.97143
## Prevalence                               0.2222               0.02778
## Detection Rate                           0.1111               0.00000
## Detection Prevalence                     0.2222               0.02778
## Balanced Accuracy                        0.6786               0.48571
##                      Class: Delhi Daredevils Class: Kings XI Punjab
## Sensitivity                               NA                0.40000
## Specificity                                1                0.96774
## Pos Pred Value                            NA                0.66667
## Neg Pred Value                            NA                0.90909
## Prevalence                                 0                0.13889
## Detection Rate                             0                0.05556
## Detection Prevalence                       0                0.08333
## Balanced Accuracy                         NA                0.68387
##                      Class: Kolkata Knight Riders Class: Mumbai Indians
## Sensitivity                               0.50000               0.50000
## Specificity                               0.96667               0.96875
## Pos Pred Value                            0.75000               0.66667
## Neg Pred Value                            0.90625               0.93939
## Prevalence                                0.16667               0.11111
## Detection Rate                            0.08333               0.05556
## Detection Prevalence                      0.11111               0.08333
## Balanced Accuracy                         0.73333               0.73438
##                      Class: Rajasthan Royals Class: Royal Challengers Bangalore
## Sensitivity                          0.75000                            0.50000
## Specificity                          0.93750                            0.90625
## Pos Pred Value                       0.60000                            0.40000
## Neg Pred Value                       0.96774                            0.93548
## Prevalence                           0.11111                            0.11111
## Detection Rate                       0.08333                            0.05556
## Detection Prevalence                 0.13889                            0.13889
## Balanced Accuracy                    0.84375                            0.70312
##                      Class: Sunrisers Hyderabad
## Sensitivity                             0.50000
## Specificity                             0.84375
## Pos Pred Value                          0.28571
## Neg Pred Value                          0.93103
## Prevalence                              0.11111
## Detection Rate                          0.05556
## Detection Prevalence                    0.19444
## Balanced Accuracy                       0.67188
# Plot the ROC curve
roc_curve <- roc(test_data$winner, as.numeric(predictions))
## Warning in roc.default(test_data$winner, as.numeric(predictions)): 'response'
## has more than two levels. Consider setting 'levels' explicitly or using
## 'multiclass.roc' instead
## Setting levels: control = Chennai Super Kings, case = Delhi Capitals
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve for Random Forest Model")

What were the original charts you planned to create for this assignments? What steps were necessary for cleaning and preparing the data?

This chart helps in evaluating the performance of the classification model by showing the counts of true positives, true negatives, false positives, and false negatives. This Roc plot illustrates the trade-off between sensitivity (true positive rate) and specificity (true negative rate) for different thresholds of the classification model.

What story could you tell with your plots? What difficulties did you encounter while creating the visualizations? What additional approaches do you think can be use to explore the data you selected?

This tells us how well the model is performing in terms of correctly predicting winners and losers of cricket matches. We can see where the model is making errors (false positives and false negatives). This ROC Curve gives us a visual representation of the model’s ability to distinguish between winners and losers. The closer the curve is to the top-left corner, the better the model’s performance.One difficulty might be handling imbalanced classes if one class (winner or loser) dominates the dataset.Additional approaches could involve feature engineering to create new variables that might better capture the dynamics of cricket matches, such as player statistics, team rankings,

How did you apply the principles of data visualizations and design for this assignment?

In data visualization, clarity and interpretability are key. Ensure that your plots are easy to understand and effectively convey the model’s performance.Use appropriate labels, titles, and annotations to make your plots informative.Consider the audience and tailor the visualizations accordingly. For cricket enthusiasts, you might delve deeper into match-specific insights, while for a general audience, you might focus on overall model performance.